Modeling for Prediction of Mortality Based on past Medical History in Hospitalized COVID-19 Patients: A Secondary Analysis

Introduction Although COVID-19 is not currently a public health emergency, it will affect susceptible individuals in the post-COVID-19 era. Hence, the present study aimed to develop a model for Iranian patients to identify at-risk groups based on past medical history (PMHx) and some other factors affecting the death of patients hospitalized with COVID-19. Methods A secondary study was conducted with the existing data of hospitalized COVID-19 adult patients in the hospitals covered by Iran University of Medical Sciences. PMHx was extracted from the registered ICD-10 codes. Stepwise logistic regression was used to predict mortality by PMHx and background covariates such as intensive care unit (ICU) admission. Crude population attributable fraction (PAF) as well as crude and adjusted odds ratio (OR) with 95% confidence interval (CI) were reported. Results A total of 8879 patients were selected with 19.68% mortality. Infectious and parasitic diseases' history showed the greatest association (OR = 5.72, 95% CI: 4.20, 7.82), while the greatest PAF was for cardiovascular system diseases (20.46%). According to logistic regression modeling, the largest effect, other than ICU admission and age, was for history of infectious and parasitic diseases (OR = 3.089, 95% CI: 2.13, 4.47). A good performance was achieved (area under curve = 0.875). Conclusion Considering the prevalence of underlying diseases, many mortality cases of COVID-19 are attributable to the history of cardiovascular disease. Future studies are needed for policy making regarding reduction of COVID-19 mortality in susceptible groups in the post-COVID-19 era.


Introduction
COVID-19 has been the most signifcant challenge to our healthcare systems in the modern era.With a total confrmed deaths of about 7 million in 771 million confrmed cases as of October 2023, it remains a challenge, with the number of cases still on the rise.Te presentations of COVID-19 are heterogeneous with asymptomatic cases to patients transitioning from mild or moderate respiratory symptoms to the development of severe pneumonia, respiratory distress, multiple organ dysfunction, need for mechanical ventilation (MV), intubation, and ultimately death [1].Te equilibrium between the overall capacity of medical resources such as ICU beds and patients that are most at risk of developing severe diseases and death upon admission has placed pressure on the existing healthcare resources and underscored the need for a perfect triage [2].
Te most frequently documented predictors of severe prognosis in COVID-19 patients encompass age, gender, fndings derived from computed tomography (CT) scans, Creactive protein levels, lactic dehydrogenase levels, and lymphocyte count [3].Furthermore, patients with underlying diseases experience more unfavorable outcomes than those without such conditions.COVID-19 patients with a history of hypertension, obesity, chronic lung disease, diabetes, and cardiovascular diseases may lead to deteriorating outcomes more often than others [4].
Utilization of models that take into account several characteristics for estimating the possibility of having a poor prognosis could aid clinicians in prioritizing patients when allocating limited healthcare resources.Prediction models can incorporate predictors to estimate the probability of specifc outcomes, aiding in risk stratifcation and personalized patient management [3].However, the existing prediction models require refnement and validation to ensure their accuracy and applicability in the clinical setting.Nevertheless, due to methodological issues such as predictor bias, variability in defning COVID-19 cases (participants' bias), and evaluation of diferent outcomes (outcome measurement bias), a substantial proportion of the reported models are susceptible to a high risk of bias [3,5].Although COVID-19 is not currently a public health emergency, it will afect susceptible individuals in the post-COVID-19 era.Terefore, there is a need to have a statistical tool for identifcation of susceptible individual.
Tis study aims to develop a model for Iranian patients to identify at-risk groups based on the past medical history (PMHx) and demographic and some other clinical variables afecting the death of patients with COVID-19 hospitalized in the hospitals covered by the Iran University of Medical Sciences.

Study Design.
A secondary study was conducted on the existed data using the existing data to design an exploratory model for prediction of death in COVID-19 patients.Te study protocol and report were based on transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement.Tis study consisted of epidemiological study and statistical modeling.Tis study was confrmed by the Ethics Committee of Iran University of Medical Sciences with ethics registration number: IR.IUMS.REC.1400.388 at 26 July 2021.

Source of Data.
A multicenter retrospective cohort registry known as the Medical Care Monitoring Center (MCMC) using the data recorded from the patients from the onset of the COVID-19 epidemic until April 2021 has been used in developing the model.All the patients hospitalized due to COVID-19 with ICD-10 code U07.1 and U07.2 in the mentioned time period were subjected for this study.Access to the data was through legal supervision of the Center of Statistics and Information Technology, Iran University of Medical Sciences.Te management of the center shared the researchers an Excel fle including all the variables of this study.In that dataset, the PMHxs were in one string variable including ICD-10 codes.Hence, the data cleaning process included creation of dummy variables for each PMHx, removing the observations with outlier ages, coding the texts for qualitative variables and managing the missing data.Patients under 18 years, people with negative test, and people with incompletely recorded information were excluded from this study.

Outcome.
In this study, in-hospital mortality in patients with COVID-19 was considered as the main outcome.Terefore, hospitalized patients with COVID-19 who died due to the disease and had a positive CT scan or PCR test were included in the mortality group.Also, patients who recovered or were discharged from the hospital were included in the opposite group.

Predictors.
Te variables of age, gender, ICU admission, marital status (being single compared with married, divorced, or widowed), length of stay (LOS), and PCR result were included in the analysis.Te underlying diseases of the patients were extracted and categorized using ICD-10 codes, and binary variables were created using the categories.Tese codes indicated any history of having these diseases resulting in hospital reference and hospitalization and were registered for the national ID of the patient at any time.Te categories included are as follows.
codes and created new variable names as follows.Te other variable names were the same.
Infectious and parasitic diseases: AB.Hematologic disorders and neoplasms: CD.
2.6.Sample Size.For epidemiological study, sample size was calculated for Pearson chi-square test of association according to 0.2 probability of positive exposure (allocation ratio), 0.2 probability of positive outcome with 0.05 difference in positive exposure group, 0.05 two-tailed alpha error, and 0.95 power was obtained 6477.For statistical modeling, powerlog package of Stata software was used for estimation of sample size for logistic regression.Considering 0.2 and 0.25 outcome probabilities (such as assumptions in the epidemiological study), 0.05 two-tailed alpha error, 0.9 power, and squared multiple correlation 0.79 (equivalent to variance infation factor (VIF) < 5), the sample size was obtained 4297 which could detect the signifcance of odds ratio (OR) � 1.333.

Missing Data.
Complete-case analysis strategy was adopted as it was assumed that the missing data were completely at-random.

Statistical Analysis.
For epidemiological study, descriptive statistics and crude tests (such as chi-square) were used.In addition, OR with 95% confdence interval (CI) and population attributable fraction (PAF) were reported based on direct calculation from two-by-two tables using -cccommand of the software.For statistical modeling, multiple logistic regression was used with backward Wald stepwise approach for model selection with maximum P value 0.1.Postestimation receiver operating characteristics (ROC) analysis was considered to study model performance, and the area under curve (AUC) was reported with 95% CI.In addition, for each predicted probability cutof point of the model, sensitivity, specifcity, correct classifcation, likelihood ratio (positive and negative), positive predictive value (PPV), negative predictive value (NPV), and Youden's J were reported.A nomogram was designed for practical use of the model.To evaluate the predictive ability of the estimated model, cross-validation was used with -cvaurocpackage of the software.All the statistical processes of this study were conducted in Stata 17 (Stata Corp. LLC, TX, US).
2.9.Sensitivity Analysis.In order to consider the complexity modeling, we conducted model selection based on Akaike information criterion (AIC) and Bayesian information criterion (BIC) using -aic_model_selection-package of the software.Te best models based on AIC and BIC were reported.Finally, the performances of all the models were compared as a sensitivity analysis along with a crossvalidation for each model.

Patients.
A total of 8879 patients were selected from MCMC according to the inclusion and exclusion criteria and removal of missing data.At frst, there were 9416 cases that 511 cases were removed due to personal consent for discharge or hospital transfer and escape from hospital.25 cases were removed due to age less than 18 years.Tere was only one observation with missing datum which had no age.Terefore, it was considered as at-random missing datum and removed from analyses.

Epidemiological Study.
Tere was 19.68% mortality among these patients.According to the cross-sectional design of this study, this outcome prevalence might be representative of the population in the studied place and time period.Terefore, all the statistical results were on the basis of this 19.68%prior probability.About 54.98% of the cases were 60-year-old and more.Demographic characteristic of the participants and the crude associations are shown.In brief, the infectious and parasitic diseases showed the greatest association (OR � 5.72, 95% CI: 4.20, 7.82), while the greatest PAF was for the cardiovascular system diseases meaning that 20.46% of COVID-19 deaths were attributable to the cardiovascular system diseases (Table 1).Te chart of PAFs is shown in Figure 1.Accordingly, there should be two ideal conditions: (1) the PAF should be similar to the proportion of dead cases who had the mentioned diseases (called sensitivity in the chart).It meant that all the deaths were attributable to the mentioned diseases.In addition, these amounts should be large enough than other similarities.(2) Having a considerable PAF in spite of having a diference with sensitivity.Regarding the frst ideal condition, infectious and parasitic diseases seemed to be the best variable as also it showed the greatest association.Regarding the second condition, diseases of the cardiovascular system showed a considerable amount of PAF.

Statistical Modeling.
Logistic regression modeling was used for prediction of death in hospitalized COVID-19 patients.According to the enter method for model selection, no pairwise-collinearity (largest r � 0.468) and no multicollinearity (largest VIF � 1.30) was observed.As the model strategy was predicted, stepwise model selection with backward Wald method was used.According to the stepwise process of model selection, 12 covariates could signifcantly predict the outcome (P < 0.1).Te largest efect, other than ICU admission and age, was for the history of infectious and parasitic diseases (OR � 3.089, 95% CI: 2.13, 4.47) (Table 2).
Tis model showed a suitable goodness-of-ft considering all the observed covariate patterns (P � 0.152, chisquare � 388.61, and degrees of freedom � 361).However, Hosmer and Lemeshow test showed lack of a goodness-of-ft for most of the quantiles (P � 0.012 for ten quantiles, the largest P � 0.251 for three quantiles).In addition, the model performance was good (AUC � 0.875, 95% CI: 0.866, 0.883) 4

Canadian Journal of Infectious Diseases and Medical Microbiology
Canadian Journal of Infectious Diseases and Medical Microbiology (Figure 2).Te result of cross-validation showed a similar performance (AUC � 0.873, standard deviation � 0.011).
Considering predicted probability more than 0.5 as the predicted positive outcome, the model sensitivity was 50.83%, the model specifcity was 91.38%, and the correctly classifcation rate was 83.40%.Considering the observed base rate of this study (19.68%), the positive and negative predictive values were 59.08% and 88.35%, respectively.Te results of other cutof points for predicted probability are shown (Table 3).Te highest correct classifcation rate was for predicted probability greater than 0.5, while the largest Youden's J was for a predicted probability greater than 0.3.Terefore, considering the observed prevalence of mortality (19.68%), the best cutof point was 0.5, while without considering this prevalence, the best cutof point was 0.3 (as correct classifcation is infuenced by prior probability, while

Canadian Journal of Infectious Diseases and Medical Microbiology
Youden's J is not infuenced).Although the specifcities were higher than the sensitivities, the NPVs were achieved higher than the PPVs due to low prior probability.Terefore, this model was better for ruling out the high-risk cases.Tis model can be used by both regression equation and nomogram (Figure 3).In the nomogram, each covariate has a scoring system.Te total score of all covariates for each covariate pattern can be converted to probability.For a practical instance, a female patient (zero score) older than 60 years (about 4.5 scores) who is ICU admitted (about 10 scores) without any underlying disease (zero score) has about 0.4 probability of death considering total score 14.5.

Sensitivity Analysis.
Te fnal model was considered as the entry for model selection based on AIC and BIC.For each scenario of model selection, AUC and cross-validation results were reported.Hence, the scenario of AIC model selection showed a similar model with the full entry model (AUC � 0.875, cross-validation range: 0.862-0.890).Te scenario of BIC model selection showed a model with four covariates less (AUC � 0.873, cross-validation range: 0.859-0.888)(supplementary table S1).

Discussion
Te main objective of this study was to develop a model for Iranian patients to identify at-risk groups based on the PMHxs afecting the death of COVID-19 hospitalized patients.In this regard, we developed a predictive model for COVID-19 mortality using the real data obtained from MCMC.Tis model utilizes logistic regression to assess the risk of hospital mortality among COVID-19 patients based on their demographic information and comorbidities.We prioritized risk factors using the PAF index.Overall, the practical goal of this study was to present a user-friendly model to the clinicians to help the at-risk patients.
Our results indicated that ICU admission was signifcantly associated with an increased probability of mortality.However, it is important to note that ICU admission is more of a proxy for severity of illness and should be used primarily for prediction purposes rather than causal inference.From our analysis of demographic data and comorbidities, we found that history of infectious and parasitic diseases had the highest association with mortality, followed by malignancies and hematologic disorders, neurologic disorders, genitourinary diseases, cardiovascular diseases, and respiratory disorders, considering crude (unadjusted) analysis.It is essential to clarify that these associations represent the strength of the relationship, not the prevalence of the risk factors.Terefore, we used the PAF index to make the results more practical for the population.
In our multivariable regression, after adjusting for various factors, ICU admission and patient age ≥ 60 remained strong predictors of mortality.In addition, infectious and parasitic diseases, hematologic and malignancies, and skin disorders emerged as the most signifcant comorbidities associated with increased mortality risk.Te diference between crude and multivariable analyses can be attributed to the role of each risk factor in predicting mortality.Given the study's purpose, covariates are not necessarily considered confounding variables, and the study is more suitable for prediction rather than establishing causality.From the comorbidities, cardiovascular disorders, malignancies, endocrine disease, and infectious and parasitic diseases had the highest PAF values, respectively.However, the prevalence of exposure to cardiovascular disorders, endocrine disease, and malignancies were higher, respectively.Te drop in PAF against prevalence of exposure to the risk factor was due to the high prevalence of the disease, such as endocrine disease, besides a weak association with mortality (high prevalence and low OR), and vice versa.Although some diseases such as infectious and parasitic were strongly associated with mortality, they were not prevalent in the hospitalized patients (low prevalence and high OR).Te diferences between PAF and exposure prevalence are shown for each PMHx in Figure 1.Also, high PAF besides high exposure prevalence represents a signifcant association with mortality.For instance, cardiovascular disorders and   2): (a) ROC curve and (b) sensitivity/specifcity plot.6 Canadian Journal of Infectious Diseases and Medical Microbiology Canadian Journal of Infectious Diseases and Medical Microbiology malignancies, being prevalent in the study population, play critical roles in COVID-19 mortality.On the other hand, less prevalent diseases such as infectious and parasitic diseases exhibit low PAF values despite their strong association with mortality.Health policymakers can utilize this approach of using the PAF index to prioritize risk factors in the population.
In concluding this section, to make practical use of this predictive model for mortality, variables with both high OR and high PAF should be selected.Such factors are signifcantly associated with mortality, are prevalent in the population, and have a high PAF.For example, cardiovascular disorders meet these criteria.Variables with high OR are better suited for personal decision-making, while PAF is more useful for predicting mortality at the population level.Tis diference is due to the necessity of the prevalence of exposed cases in the population, in contrast to the necessity of a strong association between individual comorbidities and mortality.With an AUC of 0.875 and a probability threshold of >0.5 as a positive outcome, this model demonstrates good performance.It can be applied to assess an individual's risk of mortality based on their comorbidities.
Previously, some other studies reported PAF of diferent underlying diseases for COVID-19 mortality.In general, 35.7% of COVID-19-related deaths at all ages are attributable to chronic diseases [6].Nguyen et al. studied the PAF of diferent underlying medical conditions in 87,526 hospitalized cases of COVID-19 in the US.Similar to our study, they found the highest PAF for cardiovascular diseases as it was responsible for 45% of COVID-19 deaths [7].However, the present study was performed in a more homogenous population (all patients were Iranian, while the study of Nguyen et al. included diferent races) and found 20.46% of PAF.
Several recent studies have also developed prediction models for COVID-19 mortality.For example, Hajifathalian et al. in a study on 664 patients in the US, used logistic regression to show that age, ethnicity, hypertension, cardiovascular disorders, chronic kidney disease, and several other factors were signifcantly associated with 14-day mortality.Teir model achieved an AUC of 0.86 for seven-day mortality and 0.83 for 14-day mortality [8].Te present study showed a similar performance (AUC � 0.875) with a larger sample size; however, the models' content was not similar as they had an exploratory approach.In other words, such modeling approaches on this topic would result in similar AUC in the approximate range of 0,8−0.9.2).AB: infectious and parasitic diseases (combination of A or B ICD-10 codes), CD: hematologic disorders and neoplasms (combination of C or D ICD-10 codes), G: diseases of the nervous system, I: diseases of circulatory (cardiovascular) system, K: gastrointestinal diseases, L: skin and cutaneous tissue diseases, M: diseases of musculoskeletal system and connective tissue, and N: diseases of genitourinary system.All the covariates are binary with acceptable amounts of zero (negative) and one (positive). of 0.82 for 30-day mortality using age, low age-adjusted SaO 2 , neutrophil-to-lymphocyte ratio, eGFR by the CKD-EPI equation, dyspnea, and sex.In addition, comorbidities like hypertension, obesity, liver cirrhosis, chronic neurological disorders, active neoplasia, and dementia were associated with increased risk of COVID-19 mortality; however, they were not included in this model.Moreover, the strongest association with increased mortality risk was observed in higher age categories (OR � 56.3 for age >90 years versus OR � 1 for age < 40 years).Notably, this study population had a higher median age 70 years in contrast to our study (median age � 61) [9].In spite of diferences in approaches and model covariates, such studies had an approximately similar model performance considering logistic regression modeling.
However, there were some studies with better model performance and accuracy in terms of modeling methods other than that logistic regression.According to the study of Gao et al. (2020), a highly accurate (AUC ranging from 0.9186 to 0.9762) ensemble model was developed using logistic regression, support vector machine (SVM), gradient boosted decision tree (GBDT), and neural network (NN) which was validated both internally and externally.Eight features including consciousness, male sex, sputum, blood urea nitrogen (BUN), respiratory rate (RR), D-dimer, number of comorbidities, and age were found as strong risk factors for mortality.In the mentioned study, considering threshold 0.6, AUC was calculated from 0.92 to 0.97 for the logistic regression model with an accuracy of 87.1% to 95.4% [10].Te role of comorbidities was defned better in a Cox proportional hazards' model and logistic regression model built by Moon et al. for predicting 30-day and 60-day mortalities (AUC � 0.959).Diabetes mellitus, cancer, and dementia as underlying diseases were signifcantly associated with 30-day and 60-day COVID-19 mortalities which supports our fndings.Moreover, age ≥ 70, male sex, and presence of fever and dyspnea at the time of the COVID-19 diagnosis were reported as signifcant risk factors in this study [11].Importantly, a statistically inspired modifcation of a partial least square (SIMPLS)-based model with high accuracy (AUC of 0.91 to 0.95, Q2 � 0.24) found coronary artery disease (CAD) has the highest predictive value for inhospital mortality.Similar to other studies, diabetes, age > 65, altered mental status, dementia, and SaO 2 < 88% are the other important risk factors [12].Some other machine learning models including deep neural networks (DNN), random forest classifer (RF), eXtreme gradient boosting classifer (XGB), and SVM were utilized in a prediction model by Wan et al.In the mentioned study, age, income/personal property, long-standing illness, disability, and heart disorders were signifcantly associated with death in COVID-19 patients [13].
Tis study had some limitations.First of all, it was a secondary study on patients' registry, and data gathering was beyond our control.However, no human interference in data collection, large sample size, and multicenter data were the advantages of this data collection method.In addition, selected variables were well-defned features reducing data collection bias.Tere were some concerns about overcontrol bias related to adding ICU admission to the model that impeded a causal inference.It should be noted that there is no causality relation between ICU admission and death, and therefore, it may not be an intermediate of a causation chain.Based on the prediction approach of this model selection strategy, there is no concern for using this variable.Another limitation of logistic regression models is showing predictors regardless of low PAF.So, it would not be useful enough for public health.Hence, we reported the PAFs as well.Te last limitation was not using train and test validation.Instead, we preferred to use all the available data as the main sample for analysis along with cross-validation.Te strength of this study was low risk of overftting in comparison to more complex models.Using logistic regression instead of advanced machine learning techniques and a large sample size might reduce the risk of overftting.However, P-hacking in the stepwise process was inevitable.

Conclusions
In conclusion, age ≥ 60, male sex, single marital status, comorbidities (including infectious and parasitic, hematologic and malignancies, neurologic, cardiovascular, gastrointestinal, skin and subcutaneous, musculoskeletal and connective tissue, and urogenital), and ICU admission are important predictors of mortality with an accuracy of more than 85%.Although there were a variety of advanced machine learning models, logistic regression is user friendly and easy to interpret as our audiences were clinicians.Tis study can be used in personal and public clinical decision-making to reduce COVID-19 mortality.Future studies are needed for policy making regarding reduction of COVID-19 mortality in susceptible groups in the post-COVID-19 era.Tis consists of developing guidelines and policy briefs for prevention and treatment of COVID-19 in specifc groups.

2. 3 .
Participants.Tis study includes patients admitted to afliated hospitals of Iran University of Medical Sciences.Patients with COVID-19 are included in the study based on positive polymerase chain reaction (PCR) and/or involvement in pulmonary computed tomography (CT) scan.

Figure 1 :
Figure 1: Comparison of exposure prevalence (equivalent to sensitivity in diagnostic tables) and PAF for each PMHx among the mortality cases.Te amounts are in percentage, and the negative PAFs indicate preventive fraction.PAF: population attributable fraction; PMHx: past medical history.

Figure 3 :
Figure 3: Nomogram for prediction of death probability based on logistic regression (Table2).AB: infectious and parasitic diseases (combination of A or B ICD-10 codes), CD: hematologic disorders and neoplasms (combination of C or D ICD-10 codes), G: diseases of the nervous system, I: diseases of circulatory (cardiovascular) system, K: gastrointestinal diseases, L: skin and cutaneous tissue diseases, M: diseases of musculoskeletal system and connective tissue, and N: diseases of genitourinary system.All the covariates are binary with acceptable amounts of zero (negative) and one (positive).

Table 1 :
Characteristics of the COVID-19 patients and association with mortality.
*Reporting mean and SD according to approximately normal distribution; * * reporting median and IQR according to the right-skewness of the distribution; SD: standard deviation; IQR: interquartile range; T: independent t test for variables with normal distribution; U: Mann-Whitney U test for variables with skewed distribution; chi: Pearson chi-square for comparison of qualitative variables; OR: odds ratio; CI: confdence interval; PAF: population attributable fraction (this is a preventive fraction in cases with OR <1); ICU: intensive care unit; LOS: length of stay; PCR: polymerase chain reaction.

Table 2 :
Multiple logistic regression model for prediction of death in the hospitalized COVID-19 patients.

Table 3 :
Diagnostic accuracy indices and their normal 95% CIs for prediction of death for each cut-point of the predicted probability.
Berenguer et al. developed a prediction model with an AUC